WEEK 6: VERSION CONTROL

Tuesday, February 13th

Today we will…

  • Review
    • PA 5.2: Scrambled Message
    • Lab 4 Essay: Center vs Family Care
    • Lab 5: Murder in SQL City
  • Group Contract
  • New Material
    • git/GitHub
    • Connect GitHub to RStudio
  • PA 6: Merge Conflicts – Collaborating within a GitHub Repo

Childcare Essays: Data Cleaning

Code
ca_childcare <- counties |> 
  filter(state_abbreviation == "CA") |> 
  left_join(childcare_costs,
            by = "county_fips_code"
            ) |> 
  mutate(region = fct_collapse(county_name,
                               `Superior California`         = c("Butte County", "Colusa County", "El Dorado County", "Glenn County", "Lassen County", "Modoc County", "Nevada County", "Placer County", "Plumas County", "Sacramento County", "Shasta County", "Sierra County", "Siskiyou County", "Sutter County", "Tehama County", "Yolo County", "Yuba County"),
                               `North Coast`                 = c("Del Norte County", "Humboldt County", "Lake County", "Mendocino County", "Napa County", "Sonoma County", "Trinity County"),
                               `San Francisco Bay Area`      = c("Alameda County", "Contra Costa County", "Marin County", "San Francisco County", "San Mateo County", "Santa Clara County", "Solano County"),
                               `Northern San Joaquin Valley` = c("Alpine County", "Amador County", "Calaveras County", "Madera County", "Mariposa County", "Merced County", "Mono County", "San Joaquin County", "Stanislaus County", "Tuolumne County"),
                               `Central Coast`               = c("Monterey County", "San Benito County", "San Luis Obispo County", "Santa Barbara County", "Santa Cruz County", "Ventura County"),
                               `Southern San Joaquin Valley` = c("Fresno County", "Inyo County", "Kern County", "Kings County", "Tulare County"),
                               `Inland Empire`               = c("Riverside County", "San Bernardino County"),
                               `Los Angeles County`          = c("Los Angeles County"),
                               `Orange County`               = c("Orange County"),
                               `San Diego - Imperial`        = c("Imperial County", "San Diego County")
                                )
         )|> 
  select(county_name, region, study_year, mc_infant:mfcc_preschool)
ca_childcare
# A tibble: 638 × 9
   county_name   region study_year mc_infant mc_toddler mc_preschool mfcc_infant
   <chr>         <fct>       <dbl>     <dbl>      <dbl>        <dbl>       <dbl>
 1 Alameda Coun… San F…       2008      302.       214.         214.        192.
 2 Alameda Coun… San F…       2009      313.       234.         234.        200.
 3 Alameda Coun… San F…       2010      313.       235.         235.        201.
 4 Alameda Coun… San F…       2011      314.       236.         236.        203.
 5 Alameda Coun… San F…       2012      314.       236.         236.        204 
 6 Alameda Coun… San F…       2013      312.       232.         232.        204.
 7 Alameda Coun… San F…       2014      310.       227.         227.        204.
 8 Alameda Coun… San F…       2015      344.       251.         251.        228.
 9 Alameda Coun… San F…       2016      378.       274.         274.        252.
10 Alameda Coun… San F…       2017      386.       290.         290.        271.
# ℹ 628 more rows
# ℹ 2 more variables: mfcc_toddler <dbl>, mfcc_preschool <dbl>
Code
ca_childcare |> 
  pivot_longer(cols = mc_infant:mfcc_preschool,
               names_to = "group", 
               values_to = "median_price"
               ) |> 
  separate_wider_delim(group,
                       names = c("care", "development"),
                       delim = "_"
                       ) |> 
  mutate(care = fct_recode(care,
                           "Center" = "mc",
                           "Family" = "mfcc"
                           ),
         development = fct_relevel(development,
                                   "infant", "toddler", "preschool"
                                   )
         )
# A tibble: 3,828 × 6
   county_name    region               study_year care  development median_price
   <chr>          <fct>                     <dbl> <fct> <fct>              <dbl>
 1 Alameda County San Francisco Bay A…       2008 Cent… infant              302.
 2 Alameda County San Francisco Bay A…       2008 Cent… toddler             214.
 3 Alameda County San Francisco Bay A…       2008 Cent… preschool           214.
 4 Alameda County San Francisco Bay A…       2008 Fami… infant              192.
 5 Alameda County San Francisco Bay A…       2008 Fami… toddler             178.
 6 Alameda County San Francisco Bay A…       2008 Fami… preschool           178.
 7 Alameda County San Francisco Bay A…       2009 Cent… infant              313.
 8 Alameda County San Francisco Bay A…       2009 Cent… toddler             234.
 9 Alameda County San Francisco Bay A…       2009 Cent… preschool           234.
10 Alameda County San Francisco Bay A…       2009 Fami… infant              200.
# ℹ 3,818 more rows
Code
ca_childcare <- ca_childcare |> 
  pivot_longer(cols = mc_infant:mfcc_preschool,
               names_to = c("care", "development"),
               names_sep = "_",
               values_to = "median_price"
               ) |> 
  mutate(care = fct_recode(care,
                           "Center" = "mc",
                           "Family" = "mfcc"
                           ),
         development = fct_relevel(development,
                                   "infant", "toddler", "preschool"
                                   )
         )
ca_childcare
# A tibble: 3,828 × 6
   county_name    region               study_year care  development median_price
   <chr>          <fct>                     <dbl> <fct> <fct>              <dbl>
 1 Alameda County San Francisco Bay A…       2008 Cent… infant              302.
 2 Alameda County San Francisco Bay A…       2008 Cent… toddler             214.
 3 Alameda County San Francisco Bay A…       2008 Cent… preschool           214.
 4 Alameda County San Francisco Bay A…       2008 Fami… infant              192.
 5 Alameda County San Francisco Bay A…       2008 Fami… toddler             178.
 6 Alameda County San Francisco Bay A…       2008 Fami… preschool           178.
 7 Alameda County San Francisco Bay A…       2009 Cent… infant              313.
 8 Alameda County San Francisco Bay A…       2009 Cent… toddler             234.
 9 Alameda County San Francisco Bay A…       2009 Cent… preschool           234.
10 Alameda County San Francisco Bay A…       2009 Fami… infant              200.
# ℹ 3,818 more rows

Childcare Essays: Gestalt Principles (Table Design)

Code
ca_childcare |> 
  group_by(care, development) |> 
  summarize(median = median(median_price)) |> 
  knitr::kable(digits = 2)
care development median
Center infant 258.14
Center toddler 182.36
Center preschool 182.36
Family infant 166.40
Family toddler 152.72
Family preschool 152.72
Code
ca_childcare |> 
  group_by(care, development) |> 
  summarize(median = median(median_price),
            IQR = IQR(median_price)
            ) |> 
  arrange(development) |> 
  knitr::kable(digits = 2)
care development median IQR
Center infant 258.14 54.39
Family infant 166.40 40.37
Center toddler 182.36 42.24
Family toddler 152.72 35.18
Center preschool 182.36 42.24
Family preschool 152.72 35.18
Code
ca_childcare |> 
  group_by(care, development) |> 
  summarize(median = median(median_price)
            ) |> 
  pivot_wider(id_cols = development,
              names_from = care,
              values_from = median
              ) |> 
  knitr::kable(digits = 2)
development Center Family
infant 258.14 166.40
toddler 182.36 152.72
preschool 182.36 152.72

Childcare Essays: Gestalt Principles (Graph Design)

Code
ca_childcare |> 
  ggplot(aes(x = development,
             y = median_price
             )
         ) +
  geom_boxplot() +
  facet_grid(~ care) +
  labs(x = "Development Stage",
       y = "",
       subtitle = "Median Weely Price of Childcare ($)"
       )

Code
ca_childcare |> 
  ggplot(aes(x = care,
             y = median_price
             )
         ) +
  geom_boxplot() +
  facet_grid(~ development) +
  labs(x = "Care Setting",
       y = "",
       subtitle = "Median Weely Price of Childcare ($)"
       )

Childcare Essays: Alternative Plots

Code
ca_childcare |> 
  mutate(care = fct_reorder2(.f = care,
                             .x = study_year,
                             .y = median_price
                             )
         ) |> 
  ggplot(aes(x = study_year,
             y = median_price,
             color = care,
             shape = care
             )
         ) +
  geom_point(alpha = 0.2) +
  geom_smooth() +
  facet_grid(~ development) +
  scale_color_brewer(palette = "Dark2") +
  scale_x_continuous(breaks = seq(2008, 2018, 3)) +
  scale_y_continuous(limits = c(100, 500), breaks = seq(0, 500, 50)) +
  theme_bw() +
  labs(x = "Year",
       y = "",
       subtitle = "Median Weely Price of Childcare ($)",
       color = "Care Setting",
       shape = "Care Setting"
       )

Code
ca_childcare_wide <- ca_childcare |> 
  group_by(region, care, development) |> 
  summarize(median = median(median_price)) |> 
  pivot_wider(id_cols     = c(region, development),
              names_from  = care,
              values_from = median
              )
ca_childcare_wide[1:2,]
# A tibble: 2 × 4
# Groups:   region [1]
  region                 development Center Family
  <fct>                  <fct>        <dbl>  <dbl>
1 San Francisco Bay Area infant        344.   224.
2 San Francisco Bay Area toddler       254    206.
Code
ca_childcare |> 
  group_by(region, care, development) |> 
  summarize(median = median(median_price)) |> 
  ggplot() +
  geom_segment(data = ca_childcare_wide,
               aes(x = Center,
                   xend = Family,
                   y = region,
                   yend = region),
               color = "darkgray") +
  geom_point(aes(x = median,
             y = region,
             color = care,
             shape = care
             ),
             size = 2
             ) +
  facet_grid(~ development) +
  scale_color_brewer(palette = "Dark2") +
  scale_x_continuous(limits = c(100, 500), breaks = seq(0, 500, 100)) +
  theme_bw() +
  labs(x = "Median Price ($)",
       y = "",
       subtitle = "Median Weely Price of Childcare ($)",
       color = "Care Setting",
       shape = "Care Setting"
       )

Childcare Essays: Examples

git/GitHub Basics

Git vs GitHub


  • Language for version control
  • Developed by Linus Torvalds (Linux, Android, Chrome OS)
  • Uses command line or GUI

  • Cloud-based hosting service
  • Basic services are free
  • Advanced services are paid (Similar to RStudio)

Why GitHub?

  1. A structured way for tracking changes to files over the course of a project.

  2. Makes it easy to have multiple people working on the same files at the same time.

  3. You can host a URL of fun things (like the class text, these slides, a personal website, etc.) with GitHub pages.

Think “track-changes” or “drop-box” history, but more structured.

Git Repositories

  • Think of this as a folder-directory for a single project (like your stat-331 folder!)

  • You may have code, documentation, data, TODO lists, and more associated with a project.

  • To create a repository, you can start with your local computer first, or you can start with the remote (online) repository first.

Actions in Git

Cloning a Repo

Clone = create an exact copy locally

Committing

Git tracks changes to each file that it is told to monitor, and as the files change, you provide short labels describing what the changes were and why they exist (called “commits”).

Here, we commit the red line as a change to our file.

The log of these changes (along with the file history) is called your git commit history. This means you can always go back to old copies!

Pushing

Updates the copy of the repository on another machine (e.g. on GitHub) so that it has the most recent changes you’ve made to your machine.

Pulling

Updates your local copy of the repository (the copy on your computer) with the files that are “in the cloud” (on GitHub).

Pushing and Pulling

Merge Conflicts

Occur when you make changes to the same line as a collaborator either at the same time, or without starting from the same “state”.

  1. Maybe you are working in real time on the same line of code or text.
  2. Maybe you forgot to push your changes last time you finished working.
  3. Maybe you forgot to pull your changes before you started working this time.

Workflow

Starting a new project/local repo

  1. Clone the project or create a new repository
  2. Make some changes
  3. Commit the changes
  4. Pull any changes from the remote repository
  5. Resolve any merge conflicts
  6. Push the changes (and merged files)

Workflow

Starting a new project/local repo

  1. Clone the project or create a new repository
  2. Make some changes
  3. Commit the changes
  4. Pull any changes from the remote repository
  5. Resolve any merge conflicts
  6. Push the changes (and merged files)

Working with an existing local repo

  1. Pull the repo (especially if collaborating)
  2. Make some changes
  3. Commit the changes
  4. Pull any changes from the remote repository (again!)
  5. Resolve any merge conflicts
  6. Push the changes (and merged files)

Connect GitHub to RStudio

Rpackages we will need

Work in your console or a .Rscript for this…

  1. Install and load the {usethis} Rpackage
install.packages("usethis")
library(usethis)
  1. Install and load {gitcreds} RPackage
install.packages("gitcreds")
library(gitcreds)

Configure git

  1. Tell git your email and username.
use_git_config(user.name = "JaneDoe2", user.email = "jane@example.org")

Generate your PAT (Personal Access Token)

  1. Generate token
create_github_token()

Warning

GitHub really doesn’t like it when you do not have a PAT expiration date… but I don’t ever want to deal with it again. Make sure your expiration date is AT LEAST through the end of the quarter (60 days).

Store your PAT

  1. Copy your PAT

  1. Enter password or token: Paste PAT
gitcreds_set()

Verify PAT

You should be good to go! Let’s verify.

git_sitrep()

PA 6: Merge Conflicts

Collaborating within a GitHub Repo

Get into your groups!

  • See Canvas for your group number and members (Dr. Robinson also has the list).
  • Introduce yourself by sharing something about your name.
  • Exchange contact information.
  • Grab a set of numbered cards + red/blue/yellow sticky notes

Designate each person to one of the suits – you will be referencing it as you work through this activity.

Warning

If you only have 3 group members here, assign one person both and .

Flags

Please use sticky notes to indicate how your group is doing:

  • Blue

    – We are ready to move on
  • Yellow

    – We’re figuring it out
  • Red

    – Please help!!

Repository Setup

Creating a Repo (starting from GitHub)

  • Create a new Github repository – Repositories > New
    • Name the repository stat-331-PA6
    • You can choose Public or Private
    • Select .gitignore template: R
  • Click on the Settings tab > Click on Collaborators > Add people
  • Add your other group members to the repository using their username or email

Accessing the Repo (remote repo)

  • Verify repository invite in your email – View invite > Accept invite
  • Open the repository in GitHub – github.com/

Cloning the Repo (local repo)

  • In Rstudio File > New Project > Version Control > Git (pause)
  • In GitHub, copy the HTTPS address from the repo <> Code and paste into Repository URL in RStudio
  • Click Browse and navigate to where you want to create your new project. I recommend creating this on your desktop.

Caution

Do not save this within your master stat 331 folder!!! We don’t want to embed and Rproject within an RProject.

  • Create Project

Collaborating in GitHub

Adding Documents to the Repo

  • Create a new Quarto file (using the standard template)
    • Title the document “Practice Activity 6”
    • Resist the urge to add author names
    • Save the document as PA6.qmd in your stat-331-PA6 desktop folder
    • Add self-contained: true to the YAML
    • Render the document
  • Edit the .gitignore file to include *.Rproj

Pushing Documents to the Repo

  • Git pane > Commit > Stage (or checkmark) files > Commit message > Commit
    • Commit both the .gitignore file with an explanatory message such as “Ignoring all .Rproj files in repo”
    • Commit the PA6.qmd, and PA6.html files with an explanatory message such as “Created practice activity quarto file”
  • Push the changes to the remote repository

Pulling Changes from the Repo

  • Git pane > Pull the changes that were made!

Everyone should now have the .qmd and .html files in their local repos!

Making a Change

  • Add author: to the YAML and include everyone’s first names
  • Render the document
  • Git pane > Commit > Stage (or checkmark) files > Commit message > Commit
    • Commit the changes with a message such as “added group first names”
  • Push the changes

Pushing Changes & Not Pulling

Do not pull the changes that were made!

Making the Same Change

  • Add author: to the YAML and to include everyone’s first and last names.
  • Render the document
  • Git pane > Commit > Stage (or checkmark) files > Commit message > Commit
    • Commit the changes with a message such as “added group first and last names”
  • Push your changes

Caution

Oh no 😱 You got an error! Ugh. We forgot to pull before we started making changes 😢

Forgetting to Pull before you Push

Resolving Merge Conflicts

  • Pull the changes from the repo

Resolving Merge Conflicts

Resolving Merge Conflicts

  • Review the document with the merge conflict

Tip

Note how the conflicting lines are marked! You might need to submit this to Canvas 😄

  • Resolve the conflict with the preferred change
  • Commit your changes
  • Push the changes to the repository

Pushing Changes & Not Pulling

Do not pull the changes that were made!

Making Different Changes

  • Change the first code chunk to find the product of 13 times 13.
  • Render the document
  • Commit your changes
  • Push your changes

Warning

You will get an error, read it, and pull.

No merge conflicts should occur. Now push your changes.

Auto Merge

Making Different Changes

Do not pull the changes that were made!

Making the Same Changes (Again)

  • Change the first code chunk to find the product of 11 times 11.
  • Render the document
  • Commit your changes
  • Push the changes to the repository

Caution

You will get an error. Ugh!!!! We forgot to pull again!

Making the Same Changes (Again)

  • Pull the changes from the repo
  • Review the document with the merge conflict
  • Clear the merge conflict by choosing the correct/preferred change
  • Commit your changes
  • Push you changes

Final Document

Pull, and observe the changes in your document.

Canvas Quiz Submission

Note

How does Git mark the start of lines with a merge conflict? Specifically, I want the four capital letter characters with which every conflict is marked.

Commit Tips

  • Use relatively short, but also informative commit messages.
  • Commit small blocks of changes. Work to commit every time you’ve accomplished a small task.
    • You’ll have small, bite-sized changes that are briefly described to serve as a record of what you’ve done (and what still needs doing)
    • When you mess up (or end up in a merge conflict) you will have a much easier time pinpointing the spot where things went bad, what code was there before, and (because you have nice, descriptive commit messages) how the error occurred.

Tips for avoiding merge conflicts

  • Always pull before you start working and always push after you are done working!

  • In general, if you follow the workflow for an existing local repo exactly, you only have problems if two of you are making local changes to the same line in the same file at the same time.

  • If you are working with collaborators in real time, pull, commit, and push often.

  • Git commits lines – lines of code, lines of text, etc.

    • Practice good code format and and put each sentence on its own line.

When all else fails…

Burn your local repo to the ground and clone again.

To do…

  • PA 6: Merge Conflicts
    • Due Thursday, 2/15 at 8:00am
  • Midterm Exam
    • Thursday, 2/15
    • Parts 1 & 2 completed during class, Part 3 completed 24 hours after end of class

Note

  • Office hours extended Wednesday from 1:10am - 3pm.
  • No office hours Thursday.
  • Final Project Group Contract
    • Due Sunday, 2/25 at 11:59pm

Thursday, February 15th – Midterm Exam

  • I will pass out a sheet with instructions and questions.
  • Canvas will unlock online midterm material at the beginning of class.

You are fantastic!

To do…

  • Read Chapter 7: Writing Functions
    • Check-in 7.1 due Tuesday 2/20 at 8am
  • Final Project Group Contract
    • Due Sunday 2/25 at 11:59pm